Randy Johnson
1/7/2019
Sand flea data from McDonald, J.H. 1989. Selection component analysis of the Mpi locus in the amphipod Platorchestia platensis. Heredity 62: 243-249.
See the Handbook of Biological Statistics for additional examples.
eggs = β0 + β1 mg + ε
eggs = β0 + β1 mg + ε
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.6890 4.2009 3.021 0.0056 **
## mg 1.6017 0.6176 2.593 0.0154 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
eggs = 12.7 + 1.6*mg + ε
Two ways to test this assumption:
##
## Shapiro-Wilk normality test
##
## data: .std.resid
## W = 0.93555, p-value = 0.08517
See my code and ?shapiro.test for more details.
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferonni p
## 16 3.425071 0.0021295 0.059625
See my code and ?outlierTest from the car package for more details.
We see our borderline significant outlier (#16) has very little influence over the coefficients, but it does result in a change in the error terms.
* The trend line is not changed much if we exclude this data point from the analysis.
* It does shift the line slightly up, which results in a slightly larger estimate for all data points.
The 24th data point is also highlighted here. It has a larger residual (not an outlier) and is on the edge of the distribution. This gives it increased leverage and influence.
See my code and ?influencePlot from the car package for more details.
## [1] 1 16
We see our borderline outlier in the top right corner is larger than we would expect it to be, but not quite outside of the confidence region within which we can expect our error terms to fall.
See my code and ?qqPlot from the car package for more details.
size = β0 + β1 length + ε
size = β0 + β1 length + β2 length2 + ε
size = β0 + β1 length + ε
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.59974 19.53898 0.031 0.976
## length 0.02435 0.06316 0.386 0.706
size = β0 + β1 length + β2 length2 + ε
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.634e+02 2.958e+02 -3.257 0.00625 **
## length 6.264e+00 1.913e+00 3.275 0.00603 **
## length2 -1.008e-02 3.088e-03 -3.263 0.00617 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
size = β0 + β1 length + ε
size = β0 + β1 length + β2 length2 + ε
How might you analyze data with a non-linear relationship?
## # A tibble: 569 x 31
## radius_mean texture_mean perimeter_mean area_mean smoothness_mean
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 18.0 10.4 123. 1001 0.118
## 2 20.6 17.8 133. 1326 0.0847
## 3 19.7 21.2 130 1203 0.110
## 4 11.4 20.4 77.6 386. 0.142
## 5 20.3 14.3 135. 1297 0.100
## 6 12.4 15.7 82.6 477. 0.128
## 7 18.2 20.0 120. 1040 0.0946
## 8 13.7 20.8 90.2 578. 0.119
## 9 13 21.8 87.5 520. 0.127
## 10 12.5 24.0 84.0 476. 0.119
## # ... with 559 more rows, and 26 more variables: compactness_mean <dbl>,
## # concavity_mean <dbl>, `concave points_mean` <dbl>,
## # symmetry_mean <dbl>, fractal_dimension_mean <dbl>, radius_se <dbl>,
## # texture_se <dbl>, perimeter_se <dbl>, area_se <dbl>,
## # smoothness_se <dbl>, compactness_se <dbl>, concavity_se <dbl>,
## # `concave points_se` <dbl>, symmetry_se <dbl>,
## # fractal_dimension_se <dbl>, radius_worst <dbl>, texture_worst <dbl>,
## # perimeter_worst <dbl>, area_worst <dbl>, smoothness_worst <dbl>,
## # compactness_worst <dbl>, concavity_worst <dbl>, `concave
## # points_worst` <dbl>, symmetry_worst <dbl>,
## # fractal_dimension_worst <dbl>, metastatic <lgl>
Breast cancer data from the UCI Machine Learning Repository.
See this kaggle notebook by Mike M. Lee for a more thorough exploration of this dataset.
# check for multicollinearity
glm(metastatic ~ radius_mean + perimeter_mean + area_mean + texture_mean + smoothness_mean + concavity_mean, data = bc) %>%
vif()## radius_mean perimeter_mean area_mean texture_mean
## 829.419705 873.476620 40.812258 1.176198
## smoothness_mean concavity_mean
## 1.644838 6.850699
Options:
Data on measles incidence in Baltimore were collected from published reports by various people and hosted by Ben Bolker at McMaster University.
lm(Incidence ~ dt, data = measles) %>%
durbinWatsonTest()## lag Autocorrelation D-W Statistic p-value
## 1 0.7823128 0.4342107 0
## Alternative hypothesis: rho != 0
These are partial census data for females in the Dobe area !Kung San, compiled from interviews conducted by Nancy Howell in the late 1960s.
## Warning in spreadLevelPlot.lm(mdl1):
## 7 negative fitted values removed
##
## Suggested power transformation: 1.039703
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 1.110857, Df = 1, p = 0.2919
How might you model heteroscedastic data?
Many of the problems we encounter in linear regression stem from model choice.
Two most common transormations:
Original model: y ~ β0 + β1x + ε
With power transformation: y ~ β0 + β1x2 + ε
Original model: y ~ β0 + β1x + ε
With power transformation: y ~ β0 + β1sqrt(x) + ε
Original model: y ~ β0 + β1x + ε
With log transformation: y ~ β0 + β1ln(x) + ε
Original model: y ~ β0 + β1x + ε
With log transformation: ln(y) ~ β0 + β1x + ε
| Assumption | Cause | Consequence | Diagnosis |
|---|---|---|---|
| Linear Relationship | Bad model | Inaccurate predictions | car::crPlots() |
| Multivariate Normality | Bad model | Incorrect statistics | car::qqPlot() |
| Noisy data | (p-values / CIs) | shapiro.test(residuals) | |
| No/Little Multicollinearity | Correlated variables | Unstable model coefficients | car::vif() |
| No Autocorrelation | Non-independent data | Inefficient estimators | car::durbinWatsonTest() |
| Homoscedasticity | Bad model | Incorrect statistics | car::ncvTest() |
| “Bad” data | (p-values / CIs) | car::spreadLevelPlot() |
Statistics for Lunch Team
* Greg Alvord
* Eckart Bindewald
* Taina Immonen
* Brian Luke
* George Nelson
* Ravi Ravichandran
* Tom Schneider
See https://github.com/abcsFrederick/LinearRegression for code.